Skip to content

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented Feb 6, 2026

What problem does this PR solve?

Issue Number: ref #9737

To migrate RM metadata writes to PD in a controlled way, we need explicit and testable write ownership in the manager layer first. Without write-role gates, later cutover steps will mix metadata/state persistence responsibilities and are easy to regress.

What is changed and how does it work?

  • Add ResourceGroupWriteRole with three modes: LegacyAll, PDMetaOnly, RMTokenOnly.
  • Add role gate helpers (AllowsMetadataWrite, AllowsStateWrite) and errMetadataWriteDisabled.
  • Wire optional write-role provider into NewManager, defaulting to LegacyAll (no behavior change by default).
  • Guard metadata write paths (Add/Modify/DeleteResourceGroup, UpdateControllerConfigItem, SetKeyspaceServiceLimit).
  • Propagate role into keyspace manager/service limiter persistence behavior:
    • metadata persistence only when metadata writes are allowed
    • state persistence loop only when state writes are allowed
    • service-limit load path uses non-persisting apply
  • Update RM REST service-limit setter to handle the new SetKeyspaceServiceLimit error return.
  • Add unit tests for manager/keyspace write-role gates.
resourcemanager: add write roles and metadata gates

Check List

Tests

  • Unit test

Code changes

  • Increased code complexity

Related changes

Release note

None.

Summary by CodeRabbit

  • New Features

    • Added write-role based access control for resource group and service limit operations, so persistence and state updates respect configured write permissions.
    • Enhanced service limit management with explicit persist/no-persist behavior and clearer handling of limit changes.
  • Bug Fixes

    • Fixed service limit handler to surface errors correctly (including permission-denied scenarios) instead of silently failing.

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-linked-issue dco-signoff: yes Indicates the PR's author has signed the dco. labels Feb 6, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhouqiang-cl for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

Adds a role-based write permission model (ResourceGroupWriteRole) and enforces it across manager, keyspace, and service-limiter logic; updates APIs and tests to propagate and handle write-disabled errors and to optionally persist service-limit changes.

Changes

Cohort / File(s) Summary
Write Role Foundation
pkg/mcs/resourcemanager/server/write_role.go
New ResourceGroupWriteRole type and constants (LegacyAll, PDMetaOnly, RMTokenOnly), with AllowsMetadataWrite()/AllowsStateWrite() and errMetadataWriteDisabled + helper IsMetadataWriteDisabledError.
Service Limiter Write Control
pkg/mcs/resourcemanager/server/service_limit.go
serviceLimiter gains writeRole; newServiceLimiter accepts optional write role; added setServiceLimitNoPersist() and consolidated setServiceLimitInternal(..., persist) to respect writeRole and conditional persistence.
Keyspace Manager Write-Role Integration
pkg/mcs/resourcemanager/server/keyspace_manager.go
keyspaceResourceGroupManager gains writeRole; add/modify/delete/persist operations gated by AllowsMetadataWrite/AllowsStateWrite; new service-limit APIs (setServiceLimit, setServiceLimitFromStorage, updateServiceLimit) and cleanup/invalidate behavior when limits change.
Manager-Level Write-Role Implementation
pkg/mcs/resourcemanager/server/manager.go
Manager gains writeRole and GetWriteRole(); write-role provider wired; many persistence and init paths gated by write-role; SetKeyspaceServiceLimit signature changed to return error and manager uses role-aware storage calls.
Server and API Accessors / Handler
pkg/mcs/resourcemanager/server/server.go, pkg/mcs/resourcemanager/server/apis/v1/api.go
Server exposes GetResourceGroupWriteRole(); API handler setKeyspaceServiceLimit now surfaces manager error: if metadata-write-disabled returns HTTP 403 (with error), otherwise HTTP 500 for other errors; success unchanged.
Tests
pkg/mcs/resourcemanager/server/manager_test.go
Mocks updated for write-role (GetResourceGroupWriteRole() / mockRoleConfigProvider); tests adjusted to assert behavior under different write roles (allowed vs write-disabled scenarios).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant API as APIServer
  participant Mgr as Manager
  participant KSM as KeyspaceRGManager
  participant Store as Storage

  Client->>APIServer: POST /setKeyspaceServiceLimit(keyspaceID, limit)
  APIServer->>Mgr: SetKeyspaceServiceLimit(keyspaceID, limit)
  Mgr->>KSM: updateServiceLimit(limit, persistBasedOnWriteRole)
  alt writeRole AllowsMetadataWrite
    KSM->>Store: persist service limit
    Store-->>KSM: OK
  else metadata write disabled
    KSM-->>Mgr: errMetadataWriteDisabled
  end
  KSM-->>Mgr: result (ok / error)
  Mgr-->>APIServer: (nil / error)
  alt success
    APIServer-->>Client: 200 "Success!"
  else metadata write disabled
    APIServer-->>Client: 403 error message
  else other error
    APIServer-->>Client: 500 error message
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/XXL

Suggested reviewers

  • rleungx
  • ystaticy
  • lhy1024

Poem

🐰
A nibble, a hop, a tiny config tune,
Roles set the garden beneath the moon,
Metadata gates open or stay tight,
Limits now heed who may write,
I celebrate with a joyful rune! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'rm: introduce manager write roles and gates' directly and specifically summarizes the main change—adding write-role gating to the resource manager.
Description check ✅ Passed The description provides all required sections: issue number, detailed explanation of changes with commit message, and comprehensive checklist with test/code/release-note information.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pkg/mcs/resourcemanager/server/write_role.go (1)

44-45: Prefer errors.Is over errors.ErrorEqual for sentinel error checks.

errors.ErrorEqual compares via Cause() and string matching, which can give false positives on unrelated errors with the same message. errors.Is traverses the standard Go wrap chain and is the recommended approach per project guidelines. If the sentinel gets wrapped with fmt.Errorf("…: %w", err), errors.Is will match correctly whereas ErrorEqual may not (it only unwraps via pingcap/errors.Cause, not the standard Unwrap chain).

Proposed fix
 // IsMetadataWriteDisabledError reports whether err is a metadata-write-disabled error.
 func IsMetadataWriteDisabledError(err error) bool {
-	return errors.ErrorEqual(err, errMetadataWriteDisabled)
+	return errors.Is(err, errMetadataWriteDisabled)
 }

As per coding guidelines: "Check errors with errors.Is/errors.As for sentinel errors".

pkg/mcs/resourcemanager/server/apis/v1/api.go (1)

352-360: Swagger annotations are missing @Failure 403 and @Failure 500.

The handler now returns 403 (metadata write disabled) and 500 (internal error), but only @Failure 400 is declared. Other handlers in this file (e.g., postResourceGroup at Line 174) document their 500 responses.

Proposed fix
 //	`@Success`	200				{string}	string	"Success!"
 //	`@Failure`	400				{string}	error
+//	`@Failure`	403				{string}	error
+//	`@Failure`	500				{string}	error
 //	`@Router`		/config/keyspace/service-limit/{keyspace_name} [post]

As per coding guidelines: "Update Go annotations when changing APIs; reference format: swaggo/swag" and "Swagger: make swagger-spec (SWAGGER=1) regenerates; keep annotations current".


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-triage-completed do-not-merge/needs-linked-issue labels Feb 6, 2026
@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.66%. Comparing base (2139230) to head (ed811db).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10227      +/-   ##
==========================================
+ Coverage   78.63%   78.66%   +0.03%     
==========================================
  Files         520      521       +1     
  Lines       70089    70148      +59     
==========================================
+ Hits        55112    55183      +71     
- Misses      10989    10994       +5     
+ Partials     3988     3971      -17     
Flag Coverage Δ
unittests 78.66% <77.77%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@okJiang okJiang marked this pull request as ready for review February 6, 2026 14:31
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/mcs/resourcemanager/server/apis/v1/api.go`:
- Around line 378-381: The handler currently returns http.StatusBadRequest for
any error from s.manager.SetKeyspaceServiceLimit which is incorrect for
server-side role-gating (errMetadataWriteDisabled) and inconsistent with other
handlers; update the error handling in the SetKeyspaceServiceLimit call so that
if the returned error is the sentinel errMetadataWriteDisabled you respond with
http.StatusForbidden (403) and for other manager errors respond with
http.StatusInternalServerError (500) to match
postResourceGroup/putResourceGroup/deleteResourceGroup behavior; locate the call
to s.manager.SetKeyspaceServiceLimit and adjust the conditional to inspect the
error value and map it to 403 for errMetadataWriteDisabled, otherwise return 500
with the error message.
🧹 Nitpick comments (4)
pkg/mcs/resourcemanager/server/write_role.go (1)

41-41: Nit: consider adding errStateWriteDisabled for symmetry.

Currently, state write gating silently skips persistence with no error propagation. This is fine if by design state writes should fail silently, but if an errStateWriteDisabled sentinel is ever needed for future APIs or diagnostics, it could be added here. Not blocking.

pkg/mcs/resourcemanager/server/service_limit.go (1)

51-60: Consider using a required parameter instead of variadic for writeRole.

The variadic writeRoles ...ResourceGroupWriteRole silently discards all but the first element if multiple are passed. All current call sites pass exactly one value. A required parameter would be clearer and avoid this ambiguity.

Suggested signature
-func newServiceLimiter(
-	keyspaceID uint32,
-	serviceLimit float64,
-	storage endpoint.ResourceGroupStorage,
-	writeRoles ...ResourceGroupWriteRole,
-) *serviceLimiter {
-	writeRole := ResourceGroupWriteRoleLegacyAll
-	if len(writeRoles) > 0 {
-		writeRole = writeRoles[0]
-	}
+func newServiceLimiter(
+	keyspaceID uint32,
+	serviceLimit float64,
+	storage endpoint.ResourceGroupStorage,
+	writeRole ResourceGroupWriteRole,
+) *serviceLimiter {
pkg/mcs/resourcemanager/server/keyspace_manager.go (2)

77-93: Same variadic concern as newServiceLimiter.

Same suggestion as in service_limit.go — consider a required parameter for writeRole since all call sites pass exactly one value.


195-213: deleteResourceGroup proceeds with in-memory deletion even when metadata write is skipped.

When AllowsMetadataWrite() is false (e.g., RMTokenOnly), Lines 205–209 skip the storage deletion but Lines 210–212 still remove the group from the in-memory map. On restart, the group would reappear from storage settings. This is safe in practice because Manager.DeleteResourceGroup already gates at Line 372, so this code path is unreachable with a non-metadata role. However, for defense-in-depth consistency, consider returning early (similar to persistResourceGroupRunningState) rather than silently diverging memory from storage.

@@ -0,0 +1,41 @@
// Copyright 2025 TiKV Project Authors.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright 2025 TiKV Project Authors.
// Copyright 2026 TiKV Project Authors.

Signed-off-by: okjiang <819421878@qq.com>
@okJiang
Copy link
Member Author

okJiang commented Feb 10, 2026

/retest

1 similar comment
@okJiang
Copy link
Member Author

okJiang commented Feb 10, 2026

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 10, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 ed811db link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@okJiang
Copy link
Member Author

okJiang commented Feb 11, 2026

/cc @rleungx @bufferflies

@ti-chi-bot ti-chi-bot bot requested review from bufferflies and rleungx February 11, 2026 08:13
Comment on lines +185 to +191
err := curGroup.PatchSettings(group)
if err != nil {
return err
}
if !krgm.writeRole.AllowsMetadataWrite() {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check AllowsMetadataWrite first and PatchSettings secondly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatchSettings is an in-memory operation, I don't think it needs to be prohibited?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants